
PATENT APPLICATION 
EXPRESS MAIL No . EL84 1974 3 8 8US 
ATTORNEY DOCKET No. UNDO 11 
Client/Matter No . 83208 . 0014 

SYSTEM AND METHOD FOR DISTRIBUTED 
MANAGEMENT OF DATA STORAGE 

BACKGROUND OF THE INVENTION 

1. Cross Reference to Related Patent Applications 

5 The present invention claims priority from United 

States Provisional Patent Application Serial No. 
60/183,762 for: "System and Method for Decentralized Data 
Storage" filed February 18, 2000, and United States 
Provisional Patent Application Serial No. 60/245,920 filed 
10 November 6, 2000 entitled "System and Method for 
Decentralized Data Storage" the disclosures of which are 
herein specifically incorporated by this reference. 

2. Field of the Invention, 

The present invention relates, in general, to network 
15 data storage, and, more particularly, to software, systems 
and methods for distributed allocation and management of a 
storage network infrastructure. 

3. Relevant Background. 

Economic, political, and social power are 
2 0 increasingly managed by data. Transactions and wealth are 
represented by data. Political power is analyzed and 
modified based on data. Human interactions and 

relationships are defined by data exchanges. Hence, the 
efficient distribution, storage, and management of data is 
2 5 expected to play an increasingly vital role in human 
society. 
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The quantity of data that must be managed, in the 
form of computer programs, databases, files, and the like, 
increases exponentially. As computer processing power 
increases, operating system and application software 
5 becomes larger. Moreover, the desire to access larger 
data sets such as data sets comprising multimedia files 
and large databases further increases the quantity of data 
that is managed. This increasingly large data load must 
be transported between computing devices and stored in an 
10 accessible fashion. The exponential growth rate of data 
is expected to outpace improvements in communication 
bandwidth and storage capacity, making the need to handle 
data management tasks using conventional methods even more 
urgent . 

15 Data comes in many varieties and flavors. 

Characteristics of data include, for example, the 
frequency of read access, frequency of write access, 
average size of each access request, permissible latency, 
permissible availability, desired reliability, security, 

20 and the like. Some data is accessed frequently, yet 
rarely changed. Other data is frequently changed and 
requires low latency access. These characteristics should 
affect the manner in which data is stored. 

Many factors must be balanced and often compromised 
25 in the operation of conventional data storage systems. 
Because the quantity of data stored is large and rapidly 
increasing, there is continuing pressure to reduce cost 
per bit of storage. Also, data management systems should 
be sufficiently scaleable to contemplate not only current 
30 needs, but future needs as well. Preferably, storage 
systems are designed to be incrementally scaleable so that 
a user can purchase only the capacity needed at any 
particular time. High reliability and high availability 
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are also considered desirable as data users become 
increasingly intolerant of lost, damaged, and unavailable 
data. Unfortunately, conventional data management 

architectures must compromise these factors- -no single 
5 data architecture provides a cost-effective, highly 
reliable, highly available, and dynamically scaleable 
solution. Conventional RAID (redundant array of 
independent disks) systems provide a way to store the same 
data in different places (thus, redundantly) on multiple 

10 storage devices such as hard disks. By placing data on 
multiple disks, input/output (I/O) operations can overlap 
in a balanced way, improving performance. Since using 
multiple disks increases the mean time between failure 
(MTBF) for the system as a whole, storing data redundantly 

15 also increases fault-tolerance. A RAID system relies on a 
hardware or software controller to hide the complexities 
of the actual data management so that a RAID system 
appears to an operating system to be a single logical hard 
disk. However, RAID systems are difficult to scale 

2 0 because of physical limitations on the cabling and 
controllers. Also, RAID systems are highly dependent on 
the controllers so that when a controller fails, the data 
stored behind the controller becomes unavailable. 
Moreover, RAID systems require specialized, rather than 

25 commodity hardware, and so tend to be expensive solutions. 

RAID solutions are also relatively expensive to 
maintain. RAID systems are designed to enable recreation 
of data on a failed disk or controller but the failed disk 
must be replaced to restore high availability and high 
30 reliability functionality. Until replacement occurs, the 
system is vulnerable to additional device failures. 
Condition of the system hardware must be continually 
monitored and maintenance performed as needed to maintain 
functionality. Hence, RAID systems must be physically 
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situated so that they are accessible to trained 
technicians who can perform the maintenance. This 
limitation makes it difficult to set up a RAID system at a 
remote location or in a foreign country where suitable 
5 technicians would have to be found and/or transported to 
the RAID equipment to perform maintenance functions. 

While RAID systems address the allocation and 
management of data within storage devices, other issues 
surround methods for connecting storage to computing 

10 platforms. Several methods exist including: Direct 

Attached Storage (DAS) , Network Attached Storage (NAS) , 
and Storage Area Networks (SAN) . Currently, the vast 
majority of data storage devices such as disk drives, disk 
arrays and RAID systems are directly attached to a client 

15 computer through various adapters with standardized 
software protocols such as EIDE, SCSI, Fibre Channel and 
others . 

NAS and SAN refer to data storage devices that are 
accessible through a network rather than being directly 
2 0 attached to a computing device. A client computer 
accesses the NAS/SAN through a network and requests are 
mapped to the NAS/SAN physical device or devices. NAS/SAN 
devices may perform I/O operations using RAID internally 
(i.e., within a NAS/SAN node). NAS/SAN may also automate 

2 5 mirroring of data to one or more other devices at the same 

node to further improve fault tolerance. Because NAS/SAN 
mechanisms allow for adding storage media within specified 
bounds and can be added to a network, they may enable some 
scaling of the capacity of the storage systems by adding 

3 0 additional nodes. However, NAS/SAN devices themselves 

implement DAS to access their storage media and so are 
constrained in RAID applications to the abilities of 
conventional RAID controllers. NAS/SAN systems do not 
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enable mirroring and parity across nodes, and so a single 
point of failure at a typical NAS/SAN node makes all of 
the data stored at that node unavailable. 

Because NAS and SAN solutions are highly dependent on 
5 network availability, the NAS devices are preferably 
implemented on high-speed, highly reliable networks using 
costly interconnect technology such as Fibre Channel. 
However, the most widely available and geographically 
distributed network, the Internet, is inherently 
10 unreliable and so has been viewed as a sub-optimal choice 
for NAS and SAN implementation. Hence, a need exists for 
a storage management system that enables a large number of 
unreliably connected, independent servers to function as a 
reliable whole. 

15 In general, current storage methodologies have 

limited scalability and/or present too much complexity to 
devices that use the storage. Important functions of a 
storage management mechanism include communicating with 
physical storage devices, allocating and deallocating 

2 0 capacity within the physical' storage devices, and managing 
read/write communication between the devices that use the 
storage and the physical storage devices. Storage 
management may also include more complex functionality 
including mirroring and parity operations. 

2 5 In a conventional personal computer, for example, the 

storage subsystem comprises one or more hard disk drives 
and a disk controller comprising drive control logic for 
implementing an interface to the hard drives. In RAID 
systems, multiple hard disk drives are used, and the 

3 0 control logic implements the mirroring and parity 

operations that are characteristic of RAID mechanisms. 

The control logic implements the storage management 

functions and presents the user with an interface that 
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preferably hides the complexity of the underlying physical 
storage devices and control logic. 

As currently implemented, storage management 
functions are highly constrained by, for example, the 
5 physical limitations of the connections available between 
physical storage devices. These physical limitations 
regulate the number and diversity of physical storage 
devices that can be combined to implement particular 
storage needs. For example, a single RAID controller 

10 cannot manage and store a data set across different 
buildings because the controller cannot connect to storage 
devices that are separated by such distance. Similarly, a 
hard disk controller or RAID controller has a limited 
number of devices that it can connect to. What is needed 

15 is a storage management system that supports an 
arbitrarily large number of physical devices that may be 
separated from each other by arbitrarily large distances. 

Another significant limitation of current storage 
management implementation is that the functionality is 

20 implemented in some centralized entity (e.g., the control 
logic) , that receives requests from all users and 
implements the requests in the physical storage devices. 
Even where data is protected by mirroring or parity, 
failure of any portion of the centralized functionality 

25 affects availability of all data stored behind those 
devices . 

Further, current storage management systems and 

methods are inherently static or are at best configurable 

within very limited bounds. A storage management system 

3 0 is configured at startup to provide a specified level of 

reliability, specified recovery rates, a specified and 

generally limited addressable storage capacity, and a 

restricted set of user devices from which storage tasks 
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can be accepted. As needs change, however, it is often 
desirable to alter some or all of these characteristics. 
Even when the storage system can be reconfigured, such 
reconfiguration usually involves making the stored data 
5 unavailable for some time while new storage capacity is 
allocated and the data is migrated to the newly allocated 
storage capacity. 

SUMMARY OF THE INVENTION 

Briefly stated, the present invention involves a data 
storage system that implements storage management 
functionality in a distributed manner. Preferably, the 
storage management system comprises a plurality of 
instances of storage management processes where the 
instances are physically distributed such that failure or 
unavailability of any given instance or set of instances 
will not impact the availability of stored data. 

The storage management functions in combination with 
one or more networked devices that are capable of storing 
data to provide what is referred to herein as a "storage 
2 0 substrate" . The storage management process instances 
communicate with each other to store data in a 
distributed, collaborative fashion with no centralized 
control of the system. 

In a particular implementation, the present invention 

2 5 involves systems and methods for distributing data with 

parity (e.g., redundancy) over a large geographic and 
topological area in a network architecture. Data is 
transported to, from, and between nodes using network 
connections rather than bus connections. The network data 

3 0 distribution relaxes or removes limitations on the number 

of storage devices and the maximum physical separation 
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between storage devices that limited prior fault-tolerant 
data storage systems and methods. The present invention 
allows data storage to be distributed over larger areas 
(e.g., the entire world), thereby mitigating outages from 
5 localized problems such as network failures, power 
failures, as well as natural and man-made disasters. 

BRIEF DESCRIPTION OF THE DRAWINGS 

FIG. 1 illustrates a globally distributed storage 
network in accordance with an embodiment of the present 
invention. 

FIG. 2 shows a networked computer environment in 
which the present invention is implemented; 

FIG. 3 illustrates components of a RAIN element in 
accordance with an embodiment of the present invention; 
and 

FIG. 4 shows in block diagram form process 
relationships in a system in accordance with the present 
invention; 

FIG. 5 illustrates in block diagram form functional 
2 0 entities and relationships in accordance with an 
embodiment of the present invention; 

FIG. 6 shows an exemplary set of component processes 
within a storage allocation management process of the 
present invention; and 

25 FIGs. 7A-7F illustrate an exemplary set of protection 

levels that can be provided in accordance with the systems 
and methods of the present invention. 
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DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present invention is directed to a high 
availability, high reliability storage system that 
leverages rapid advances in commodity computing devices 
5 and the robust nature of internetwork technology such as 
the Internet. In general, the present invention involves 
a redundant array of inexpensive nodes (RAIN) distributed 
throughout a network topology. Nodes may be located on 
local area networks (LANs) , metropolitan area network 

10 (MAN) , wide area networks (WANs) , or any other network 
having spatially distanced nodes. Nodes are preferably 
internetworked using mechanisms such as the Internet. In 
specific embodiments, at least some nodes are publicly 
accessible through public networks such as the Internet 

15 and the nodes communicate with each other by way of 
private networks and/or virtual private networks, which 
may themselves be implemented using Internet resources. 

Significantly, the nodes implement not only storage, 
but sufficient intelligence to communicate with each other 
2 0 and manage not only their own storage, but storage on 
other nodes. For example, storage nodes maintain state 
information describing other storage nodes capabilities, 
connectivity, capacity, and the like. Also, storage nodes 
may be enabled to cause storage functions such as 

2 5 read/write functions to be performed on other storage 

nodes. Traditional storage systems do not allow peer-to- 
peer type information sharing amongst the storage devices 
themselves. In contrast, the present invention enables 
peer-to-peer information exchange and, as a result, 

3 0 implements a significantly more robust system that is 

highly scaleable. The system is scaleable because, among 
other reasons, many storage tasks can be implemented in 
parallel by multiple storage devices. The system is 
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robust because the storage nodes can be globally 
distributed making the system immune to events in any one 
or more geographical, political, or network topological 
location . 

5 The present invention is implemented in a globally 

distributed storage system involving storage nodes that 
are optionally managed by distributed storage allocation 
management (SAM) processes. The nodes are connected to a 
network and data is preferably distributed to the nodes in 

10 a multi-level, fault -tolerant fashion. In contrast to 
conventional RAID systems, the present invention enables 
mirroring, parity operations, and divided shared secrets 
to be spread across nodes rather than simply across hard 
drives within a single node. Nodes can be dynamically 

15 added to and removed from the system while the data 
managed by the system remains available. In this manner, 
the system of the present invention avoids single or 
multiple failure points in a manner that is orders of 
magnitude more robust than conventional RAID systems. 

20 The present invention is illustrated and described in 

terms of a distributed computing environment such as an 
enterprise computing system using public communication 
channels such as the Internet. However, an important 
feature of the present invention is that it is readily 

2 5 scaled upwardly and downwardly to meet the needs of a 
particular application. Accordingly, unless specified to 
the contrary the present invention is applicable to 
significantly larger, more complex network environments as 
well as small network environments such as those typified 

30 by conventional LAN systems. 

The present invention is directed to data storage on 

a network 101 shown in FIG. 1. FIG. 1 shows an exemplary 

internetwork environment 101 such as the Internet. The 
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Internet is a global internetwork formed by logical and 
physical connections between multiple wide area networks 
(WANs) 103 and local area networks (LANs) 104. An 
Internet backbone 102 represents the main lines and 
routers that carry the bulk of the traffic. The backbone 
is formed by the largest networks in the system that are 
operated by major Internet Service Providers (ISPs) such 
as GTE, MCI, Sprint, UUNet, and America Online, for 
example. While single connection lines are used to 
conveniently illustrate WAN 103 and LAN 104 connections to 
the Internet backbone 102, it should be understood that in 
reality multi-path, routable wired and/or wireless 
connections exist between multiple WANs 103 and LANs 104. 
This makes internetwork 101 robust when faced with single 
or multiple failure points. 

It is important to distinguish network connections 
from internal data pathways implemented between peripheral 
devices within a computer. A "network" comprises a system 
of general purpose, usually switched, physical connections 
that enable logical connections between processes 
operating on nodes 105. The physical connections 

implemented by a network are typically independent of the 
logical connections that are established between processes 
using the network. In this manner, a heterogeneous set of 
processes ranging from file transfer, mail transfer, and 
the like can use the same physical network. Conversely, 
the network can be formed from a heterogeneous set of 
physical network technologies that are invisible to the 
logically connected processes using the network. Because 
the logical connection between processes implemented by a 
network is independent of the physical connection, 
internetworks are readily scaled to a virtually unlimited 
number of nodes over long distances . 
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In contrast, internal data pathways such as a system 
bus, Peripheral Component Interconnect (PCI) bus, 
Intelligent Drive Electronics (IDE) bus, Small Computer 
System Interface (SCSI) bus, Fibre Channel, and the like 
5 define physical connections that implement special -purpose 
connections within a computer system. These connections 
implement physical connections between physical devices as 
opposed to logical connections between processes. These 
physical connections are characterized by limited distance 
10 between components, limited number of devices that can be 
coupled to the connection, and constrained format of 
devices that can communicate over the connection. 

To generalize the above discussion, the term 
"network" as it is used herein refers to a means enabling 

15 a physical and logical connection between devices that 1) 
enables at least some of the devices to communicate with 
external sources, and 2) enables the devices to 
communicate with each other. It is contemplated that some 
of the internal data pathways described above could be 

20 modified to implement the peer-to-peer style communication 
of the present invention, however, such functionality is 
not currently available in commodity components. 
Moreover, such modification, while useful, would fail to 
realize the full potential of the present invention as 

2 5 storage nodes implemented across, for example, a SCSI bus 
would inherently lack the level of physical and 
topological diversity that can be achieved with the 
present invention . 

Referring again to FIG. 1, the present invention is 
30 implemented by implementing a plurality of storage 
management mechanisms 106 controlling a plurality of 
storage devices at nodes 105. For ease of understanding, 
mechanisms 106 are illustrated as distinct entities from 
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entities 105. In preferred implementations, however, 
storage nodes 105 and storage management mechanisms 106 
are merged in the sense that both are implemented at each 
node 105/106. However, it is contemplated that they may 
5 be implemented in distinct network nodes as literally 
shown in FIG. 1. 

The storage at any node 10 5 may comprise a single 
hard drive, may comprise a managed storage system such as 
a conventional RAID device having multiple hard drives 

10 configured as a single logical volume, or may comprise any 
reasonable hardware configuration spanned by these 
possibilities. Significantly, the present invention 

manages redundancy operations across nodes, as opposed to 
within nodes, so that the specific configuration of the 

15 storage within any given node can be varied significantly 
without departing from the present invention. 

Optionally, one or more nodes such as nodes 106 
implement storage allocation management (SAM) processes 
that manage data storage across multiple nodes 105 in a 

2 0 distributed, collaborative fashion. SAM processes may be 

implemented in a centralized fashion within special - 
purpose nodes 106. Alternatively, SAM processes are 
implemented within some or all of the RAIN nodes 105. The 
SAM processes communicate with each other and handle 
25 access to the actual storage devices within any particular 
RAIN node 105. The capabilities, distribution, and 

connections provided by the RAIN nodes 105 in accordance 
with the present invention enable storage processes (e.g., 
SAM processes) to operate with little or no centralized 

3 0 control for the system as whole. 

In a particular implementation, SAM processes provide 

data distribution across nodes 105 and implement recovery 

in a fault-tolerant fashion across network nodes 105 in a 
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manner similar to paradigms found in RAID storage 
subsystems. However, because SAM processes operate across 
nodes rather than within a single node or within a single 
computer, they allow for greater levels of fault tolerance 
5 and storage efficiency than those that may be achieved 
using conventional RAID systems. Moreover, it is not 
simply that the SAM processes operate across network 
nodes, but also that SAM processes are themselves 
distributed in a highly parallel and redundant manner, 
10 especially when implemented within some or all of the 
nodes 105. By way of this distribution of functionality 
f** as well as data, failure of any node or group of nodes 

'£] will be much less likely to affect the overall 

oi availability of stored data. 

Hi 15 For example, SAM processes can recover even when a 

ill 

flj network node 105, LAN 104, or WAN 103 becomes unavailable. 

!L Moreover, even when a portion of the Internet backbone 102 

ftl becomes unavailable through failure or congestion the SAM 

f 8 ! processes can recover using data distributed on nodes 105 

O 20 and functionality that is distributed on the various SAM 
|as= nodes 106 that remain accessible. In this manner, the 

present invention leverages the robust nature of 
internetworks to provide unprecedented availability, 
reliability, and robustness. 

2 5 FIG. 2 shows an alternate view of an exemplary 

network computing environment in which the present 
invention is implemented. Internetwork 101 enables the 
interconnection of a heterogeneous set of computing 
devices and mechanisms ranging from a supercomputer or 

30 data center 201 to a hand-held or pen-based device 206. 
While such devices have disparate data storage needs, they 
share an ability to retrieve data via network 101 and 
operate on that data using their own resources. Disparate 
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„ - rM including mainframe computers (e.g., VAX 
computing devxces xnclu g ^ ^ ^ as 

station 202 and IBM AS/ ^ ^ xBM 

compatible dev.ce 203 via internetwork 

computer 205 are easily interco wireles3 
10 , T he present nven i a- ^ ^ ^ 
device connects to de^ ^ ^ u , e . 

computers, pagers, 

v on comprises a set of logical 
mternet-based network 213 comp internetw0 rk 
^ which are made tnrouyn 
connections, some of whxch , nternal ne tworks 214. 

101, between a plurality ° ^ & ^ 

n Tnt-ernet -based network 
conceptually. , connections between 

103 in that it e ;^ leS In J net . based network 2i3 may 
S patially distant nodes Int. and 
5 be implemented using the Interne ^ 
private »H technologies including lease 
Channel, frame relay, and the like. 

, internal networks 214 are conceptually 
Similarly, internal ^ ^ ^ 

akin to LANs 104 shown in • dt3tanc es than 

—al connections J- «^ ^ 214 may be 
those allowed by a «W 1 ^ including Bthernet, 

implemented using Lffl M) , Token Ring, 

Fiber Distributed Data Interface I 
App leTalk, Fibre Channel, and the like. 

, .twork 214 connects one or more RAM 
25 Each internal network elements 

elements 2!= to implement KAIN -d- * ^^^tware 
215 illustrate "nX 105. Conversely, a 

plat£ orm that ^^^'^ abstrao t logical entity 
RAIN node 105 refers to m func tionality to 

30 that illustrates the presence <** ^ ^ iae3 a 

external network user ^ Bach ^ ^ 

pr ocesso, memc^ and o ^ ^ ^ hard 

such as hard disKs. 
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v. . ma v be conventional EIDE or SCSI 

disk controllers that may be c ° such ag ^id 

control!-., or may be managing ntroU ^ 

RAIN elements 215 may 

controllers. Rain ks sha rmg 

d is P ersed or co-located in one - mo ^ ^ ^ 

resourC es such as cooling and power ^ 
independent of other «*^ MS dQes not a££ ect 



£ nne noae x ^ ^ 

availability of -her n on Qther 

node 105 may be reconstructed 



nodes 105 



25 



, OT r 9 is highly physical 
T he perspective ^mentation 
and it should be Kept in mind that P Y ^ 
o£ the present invention may ta* a var ^ ^ ^ 
mu lti-tierednet„or. structure^ FX ^ 

. single tier in which a ^ three or ™re 

dir ec tl y with the Interne- ^ ^ ^ clustered 

networ* tiers may be presen^ ^ £eature of th e 

behind any given tie- adapt able to these 

present invention is that it 
heterogeneous implementations. 

« 215 are shown in greater detail in FIG. 
^ elements 215 ^ ^ element3 215 

3 . - a ^; C ; lar J comodicy components such as 

comprise computers using & ^^^d 

In tel-based microprocessors moun ^ ^ ^ 

supporting a PCI bus 303 and ional AT or ATX 

ac cess memory (»M, 302 hou ed in a ^ ^ 

opqT or IDE controllers 
case. SCSI or xu , connected to the 

th e motherboard and/or by f * os are imple mented 

-I 303. — b th ;/t l expansion bus 303 is 

» only on the motherboa d o ^ the mothe rboard 

optional . In a particular i and an PCI 

Elements two master! g ^ ^ additiQnal 

expansion card is 
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«-v,«h each RAIN element 215 
mastering SIDE channel, so each ^ 

„ ,„ four EIDE hard disks 30/, 
includes up to four icular implementation, 

dedicated EIDE channel. In P di3k £or 

each hard disk 30, =^*' « ^ * J per ^ element 

;rr:rn;rrr:es 3 su P :ort l n g - - 

power supplies and cooling devices (not shown, . 

^larnqsed above is 
Tne specific implementation discusse^ 

readily modified to ^ ^J\ nMn use s networK 
application. Because the pres ^ ^ 

m ethods to communicate with ^ 
particular implementation of the 3t 9 nodes , making 
h idden from the devices using the st J ^ 
th e present invention unruly ^ o£ sy3te ms 

of node configuration and high V 

comprised by heterogeneous —age ^ instruction set 

For example, J^ified and may vary 

architecture, and the like ^ ^ ^ 

fro m node to s 215 can he readily 

, configuration within W q£ a particula r 

increased or decreased to imple mented using 

application. Tf mass storage devices 

magnetic hard dis s typ ^ ^ ^ optical 

3uch as a P tomic force probe storage and 

S tape, holographic storage equiva lents as they 

the like can be used con f igurations 

b ecome increasingly availabl y ^ ^ ^ 

including RAM capacity RAM P ^ present 

SRA M, SDRAM) can vary from node to advanCage of 

30 invention incrementally Hetwork inte rface 

ne „ technologies ^ = ^ £ f ^ o£ ex pansion cards 

components may be provided ^ 
coupled to a mother board or bur ^ 
may operate with a variety 
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(e.g., 10 BaseT Ethernet, 100 BaseT Ethernet, Gigabit 
Ethernet, 56K analog modem) and can provide varying levels 
of buffering, protocol stack processing, and the like. 

RAIN elements 215 desirably implement a "heartbeat" 
5 process that informs other RAIN nodes or storage 
management processes of their existence and their state of 
operation. For example, when a RAIN node 105 is attached 
to a network 213 or 214, the heartbeat message indicates 
that the RAIN element 215 is available, and notifies of 

10 its available storage. The RAIN element 215 can report 
disk failures that require parity operations. Loss of the 
heartbeat for a predetermined length of time may result in 
reconstruction of an entire node at an alternate node or 
in a preferable implementation, the data on the lost node 

15 is reconstructed on a plurality of pre-existing nodes 
elsewhere in the system. In a particular implementation, 
the heartbeat message is unicast to a single management 
node, or multicast or broadcast to - a plurality of 
management nodes periodically or intermittently. The 

2 0 broadcast may be scheduled at regular or irregular 

intervals, or may occur on a pseudorandom schedule. The 
heartbeat message includes information such as the network 
address of the associated RAIN node 105, storage capacity, 
state information, maintenance information and the like. 

25 Specifically, it is contemplated that the processing 

power, memory, network connectivity and other features of 
the implementation shown in FIG. 3 could be integrated 
within a disk drive controller and actually integrated 
within the housing of a disk drive itself. In such a 

3 0 configuration, a RAIN element 215 might be deployed simply 

by connecting such an integrated device to an available 
network, and multiple RAIN elements 215 might be housed in 
a single physical enclosure. 
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Each RAIN element 215 may execute an operating 



operating system (OS) or UNIX-variant OS such as Linux. 
It is contemplated, however, that other operating systems 
including DOS, Microsoft Windows, Apple Macintosh OS, 
OS/2, Microsoft Windows NT and the like may be 
equivalently substituted with predictable changes in 
performance. Moreover, special purpose lightweight 

operating systems or micro kernels may also be used, 
although the cost of development of such operating systems 
may be prohibitive. The operating system chosen 

implements a platform for executing application software 
and processes, mechanisms for accessing a network, and 
mechanisms for accessing mass storage. Optionally, the 
OS supports a storage allocation system for the mass 
storage via the hard disk controller (s) . 

Various application software and processes can be 
implemented on each RAIN element 215 to provide network 
connectivity via a network interface 3 04 using appropriate 
network protocols such as User Datagram Protocol (UDP) , 
Transmission Control Protocol (TCP) , Internet Protocol 
(IP) , Token Ring, Asynchronous Transfer Mode (ATM) , and 
the like. 

In the particular embodiments, the data stored in any 
particular node 105 can be recovered using data at one or 
more other nodes 10 5 using data recovery and storage 
management processes. These data recovery and storage 
management processes preferably execute on a node 106 
and/or on one or more of the nodes 105 separate from the 
particular node 105 upon which the data is stored. 
Conceptually, storage management is provided across an 
arbitrary set of nodes 105 that may be coupled to 
separate, independent internal networks 215 via 



system. 



The particular implementations use a UNIX 
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internetwork 213. This increases availability and 

reliability in that one or more internal networks 214 can 
fail or become unavailable due to congestion or other 
events without affecting the overall availability of data. 

5 In an elemental form, each RAIN element 215 has some 

superficial similarity to a network attached storage (NAS) 
device. However, because the RAIN elements 215 work 
cooperatively, the functionality of a RAIN system 
comprising multiple cooperating RAIN elements 215 is 

10 significantly greater than a conventional NAS device. 
Further, each RAIN element preferably supports data 
structures that enable parity operations across nodes 105 
(as opposed to within nodes 105) . These data structures 
enable operation akin to RAID operation, however, because 

15 the RAIN operations are distributed across nodes and the 
nodes are logically, but not necessarily physically 
connected, the RAIN operations are significantly more 
fault tolerant and reliable than conventional RAID 
systems . 

2 0 FIG. 4 shows a conceptual diagram of the relationship 

between the distributed storage management processes in 
accordance with the present invention. SAM processes 4 06 
represent a collection of distributed instances of SAM 
processes 106 referenced in FIG. 1. Similarly, RAIN 405 
25 in FIG. 5 represents a collection of instances of RAIN 
nodes 105 referenced in FIG. 1. It should be understood 
that RAIN instances 405 and SAM instances 406 are 
preferably distributed processes. In other words, the 
physical machines that implement these processes may 

3 0 comprise tens, hundreds, or thousands of machines that 

communicate with each other directly or via network (s) 101 
to perform storage tasks . 
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request to a domain name using HyperText Transport 
Protocol (HTTP) , Secure HyperText Transport Protocol 
(HTTPS) , File Transfer Protocol (FTP), or the like. The 
Internet Domain Name System (DNS) will resolve the storage 
5 request to a particular IP address identifying a specific 
storage node 215 that implements the SAM processes 401. 
Client 503 then directs the actual storage request using a 
mutual protocol to the identified IP address. 

The storage request is directed using network routing 
10 resources to a storage node 215 assigned to the IP 
address. This storage node then conducts storage 

operations (i.e., data read and write transactions) on 
Ijf mass storage devices implemented in the storage node 215, 

ft] or on any other storage node 215 that can be reached over 

f\ 15 an explicit or virtual private network 501. Some storage 

flj nodes 215 may be clustered as shown in the lower left side 

1. of FIG. 5., and clustered storage nodes may be accessible 

fy through another storage node 215. 

J; Preferably, all storage nodes are enabled to exchange 

H 20 state information via private network 501. Private 

network 501 is implemented as a virtual private network 
over Internet 101 in the particular examples. In the 
particular examples, each storage node 215 can send and 
receive state information. However, it is contemplated 
25 that in some applications some storage nodes 215 may need 
only to send their state information while other nodes 215 
act to send and receive storage information. The system 
state information may be exchanged universally such that 
all storage nodes 215 contain a consistent set of state 
30 information about all other storage nodes 215. 
Alternatively, some or all storage nodes 215 may only have 
information about a subset of storage nodes 215. 
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Another feature of the present invention involves the 
installation and maintenance of RAIN systems such as that 
shown in FIG. 5. Unlike conventional RAID systems, a RAIN 
system enables data to be cast out over multiple, 
5 geographically diverse nodes. RAIN elements and systems 
will often be located at great distances from the 
technical resources needed to perform maintenance such as 
replacing failed controllers or disks. While the 

commodity hardware and software at any particular RAIN 
10 node 105 is highly reliable, it is contemplated that 
failures will occur. 

Using appropriate data protections, data is spread 
across multiple RAIN nodes 105 and/or multiple RAIN 
systems as described above. In event of a failure of one 

15 RAIN element 215, RAIN node 105, or RAIN system, high 
availability and high reliability functionality can be 
restored by accessing an alternate RAIN node 105 or RAIN 
system. At one level, this reduces the criticality of a 
failure so that it can be addressed days, weeks, or months 

20 after the failure without affecting system performance. 
At another level, it is contemplated that failures may 
never need to be addressed. In other words, a failed disk 
might never be used or repaired. This eliminates the need 
to deploy technical resources to distant locations. In 

2 5 theory, a RAIN node 105 can be set up and allowed to run 
for its entire lifetime without maintenance. 

FIG. 6 illustrates an exemplary storage allocation 
management system including an instance 601 of SAM 
processes that provides an exemplary mechanism for 
30 managing storage held in RAIN nodes 105. SAM processes 
601 may vary in complexity and implementation to meet the 
needs of a particular application. Also, it is not 
necessary that all instances 601 be identical, so long as 

-23- 

\\\BO - 83208/14 - #19970 v2 




they share a common protocol to enable interprocess 
communication. SAM processes instance 601 may vary in 
complexity from relatively simple file system-type 
processes to more complex redundant array storage 
5 processes involving multiple RAIN nodes 105. SAM 
processes may be implemented within a storage -using 
client, within a separate network node 106, or within some 
or all of RAIN nodes 105. In a basic form, SAM processes 
601 implements a network interface 604 to communicate 

10 with, for example, network 101, processes to exchange 
state information with other instances 601, and store the 
state information in a state information data structure 
603 and to read and write data to storage nodes 105. 
These basic functions enable a plurality of storage nodes 

15 105 to coordinate their actions to implement a virtual 
storage substrate layer upon which more complex SAM 
processes 601 can be implemented. 

In a more complex form, contemplated SAM processes 
601 comprise a plurality of SAM processes that provide a 

2 0 set of functions for managing storage held in multiple 

RAIN nodes 105 and are used to coordinate, facilitate, and 
manage participating nodes 105 in a collective manner. In 
this manner, SAM processes 601 may realize benefits in the 
form of greater access speeds, distributed high speed data 
25 processing, increased security, greater storage capacity, 
lower storage cost, increased reliability and 
availability, decreased administrative costs, and the 
like. 

In the particular example of FIG. 6, SAM processes 

3 0 are conveniently implemented as network- connected servers 

that receive storage requests from a network-attached file 
system. Network interface processes 6 04 may implement a 
first interface for . receiving storage requests from a 
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storage capacity, migration of data between storage nodes 
105 , redundancy synchronization between redundant data 
copies, and the like. SAM processes 601 preferably 
abstract or hide the underlying configuration, location, 
5 cost, and other context information of each RAIN node 105 
from data users. SAM processes 601 also enable a degree 
of fault tolerance that is greater than any storage node 
in isolation as parity is spread out in a configurable 
manner across multiple storage nodes that are 
10 geographically, politically, and network topologically 
dispersed . 

In one embodiment, the SAM processes 601 define 
multiple levels of RAID-like fault tolerant performance 
across nodes 105 in addition to fault tolerate 
15 functionality within nodes, including: 

Level 0 RAIN, where data is striped across multiple 
nodes, without redundancy; 

Level 1 RAIN, where data is mirrored between or among 
nodes ; 

2 0 Level 2 RAIN, where parity data for the system is 

stored in a single node. 

Level 3 RAIN, where parity data for the system is 
distributed across multiple nodes; 

Level 4 RAIN, where parity is distributed across 
25 multiple RAIN systems and where parity data is mirrored 
between systems; 

Level 5 RAIN, where parity is distributed across 
multiple RAIN systems and where parity data for the 
multiple systems stored in a single RAIN system; and 
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(ECC) is used to protect against failure of one or more of 
the devices. In the example of FIG. 7C, data element A is 
broken into multiple stripes (e.g., stripes AO and Al in 
FIG. 7B) and each stripe is written to an independent 
5 node. In a particular example, four stripes and hence 
four independent nodes 105 are used, although any number 
of stripes may be used to meet the needs of a particular 
application. 

Striping offers a speed advantage in that smaller 
10 writes to multiple nodes can often be accomplished in 
parallel faster than a larger write to a single node. 
Level 2 RAIN is more efficient in terms of disk space and 
write speed than is a level 1 RAIN implementation, and 
provides data protection in that data from an unavailable 
15 node can be reconstructed from the ECC data. However, 
level 2 RAIN requires the computation and storage of ECC 
information (e.g., ECC/Ax-ECC/Az in FIG. 7C) corresponding 
to the data element (A) for every write. The ECC 
information is used to reconstruct data from one or more 

2 0 failed or otherwise unavailable nodes. The ECC 

information is stored on an independent element 215, and 
so can be accessed even when one of the other nodes 215 
becomes unavailable . 

FIG. 7D illustrates RAIN Level 3/4 configuration in 
25 which data is striped, and parity information is used to 
protect the data rather than ECC. Level 4 RAIN differs 
from Level 3 RAIN essentially in that Level 4 RAIN sizes 
each stripe to hold a complete block of data such that the 
data block (i.e., the typical size of I/O data) does not 

3 0 have to be subdivided. SAM processes 601 provide for 

parity generation, typically by performing an exclusive-or 
(XOR) operation on data as it is added to a stripe and the 
results of the XOR operation stored in the parity stripe - 
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